Search CORE

37 research outputs found

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

Author: A Duran
A Rico
B Chapman
C Augonnet
D Chasapis
E Ayguadé
J Reinders
R Dennard
Robert D. Blumofe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/05/2018
Field of study

As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community. Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15×, averaging to 3.1× over the baseline.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 671697 and No. 779877. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, the authors would like to thank Thomas Grass for his valuable help with the simulator.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models

Author: A. Peymandoust
Alastair R. Beresford
Andreas Gal Albert Noll
Bram Adams
Bratin Saha
Carl Hewitt
Charles Antony Richard Hoare
Charles R. Johns
Chen-Yong Cher
Colin Blundell
David Ungar
David Wentzlaff
Doug Lea
ECMA International
Edward A. Lee
freescale semiconductor
Georg Sorst
Gul Agha
Hans Schippers
Haris Volos
Intel Corporation
James Gosling
Jim Gray
John A. Trono
John S. Danaher
John Zigman
Jos'e M. Piquer
Kevin Casey
Kevin Williams
Larry Seiler
Lukasz Ziarek
M. Anton Ertl
Mark S. Miller
Maurice Herlihy
Michael Haupt
Michael R. Marty
Nir Shavit
Pascal Costanza
Philipp Haller
Rajesh K. Karmani
Robert D. Blumofe
Robert Virding
Simon Gay
Sriram Srinivasan
Stefan Marr
Stefan Marr
Stijn Timbermont
Theo D'Hondt
Thomas Kistler
Tom Van Cutsem
Uwe Kastens
Vijay A. Saraswat
Virendra J. Marathe
Wenzhang Zhu
Wolfgang De Meuter
Xu Wang
Yaoqing Gao
Publication venue: 'Open Publishing Association'
Publication date: 01/02/2010
Field of study

The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Kent Academic Repository

Optimal (Randomized) Parallel Algorithms in the Binary-Forking Model

Author: Acar U. A.
Acar Umut A.
Agrawal Kunal
Agrawal Kunal
Akhremtsev Yaroslav
Arora N. S.
Ben-David Naama
Ben-David Naama
Blelloch Guy E
Blelloch Guy E
Blelloch Guy E
Blelloch Guy E.
Blelloch Guy E.
Blelloch Guy E.
Blumofe Robert D.
Cole Richard
Cole Richard
Cole Richard
Dhulipala Laxman
Dhulipala Laxman
Gil J.
Goodrich Michael T.
Gustedt Jens
Guy
Guy
Miller G.L.
Nievergelt Jürg
Rajasekaran S.
Valiant L. G.
Vishkin Uzi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/06/2020
Field of study

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain). The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an

\Omega(\log n)

overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized

arXiv.org e-Print Archive

Crossref

Decentralized list scheduling

Author: A. Robison
C. Chekuri
D. Traoré
Denis Trystram
J. J. Hwang
J. Leung
L. Rudolph
M. A. Bender
M. Adler
M. Drozdowski
M. Frigo
M. Mitzenmacher
Marc Tchiboukdjian
N. S. Arora
Nicolas Gast
P. Berenbrink
P. Berenbrink
P. Berenbrink
P. Berenbrink
P. Sanders
R. D. Blumofe
R. L. Graham
S. Kotz
T. Gautier
Y. Azar
Y. Robert
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Executing multithreaded programs efficiently

Author: Blumofe Robert D. (Robert David)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1995
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 135-145).by Robert D. Blumofe.Ph.D

DSpace@MIT

Managing Storage for Multithreaded Computations

Author: Campbell L. Searle
Charles E. Leiserson
Robert D. Blumofe
Robert D. Blumofe
Publication venue
Publication date: 01/01/1992
Field of study

Multithreading has become a dominant paradigm in general purpose MIMD parallel computation. To execute a multithreaded computation on a parallel computer, a scheduler must order and allocate threads to run on the individual processors. The scheduling algorithm dramatically affects both the speedup attained and the space used when executing the computation. We consider the problem of scheduling multithreaded computations to achieve linear speedup without using significantly more space-per-processor than required for a single-processor execution

CiteSeerX

Hood: A User-Level Threads Library for Multiprogrammed Multiprocessors

Author: Dionisios Papadopoulos
Robert D. Blumofe
Publication venue
Publication date
Field of study

The Hood user-level threads library delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. It does so by scheduling threads with a non-blocking implementation of the work-stealing algorithm. With this implementation, the execution time of a program running with arbitrarily many processes on arbitrarily many processors can be modeled as a simple function of work and critical-path length. This model holds even when the program runs on a set of processors that arbitrarily grows and shrinks over time. In all cases, we observe linear speedup whenever the number of processes is small relative to the parallelism. 1 Introduction As small-scale multiprocessors make their way onto desktops, the high-performance parallel applications that run on these machines will have to live alongside other applications, such as editors and web browsers. Similarly, users expect multiprocessor compute servers to supp..

CiteSeerX

Adaptive and Reliable Parallel Computing on Networks of Workstations

Author: Philip A. Lisiecki
Robert D. Blumofe
Publication venue
Publication date: 01/01/1996
Field of study

In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk (pronounced “silk”) is a parallel multithreaded extension of the C language, and all Cilk runtime systems employ a provably efficient threadscheduling algorithm. Cilk-NOW is such a runtime system, and in addition, Cilk-NOW automatically delivers adaptive and reliable execution for a functional subset of Cilk programs. By adaptive execution, we mean that each Cilk program dynamically utilizes a changing set of otherwise-idle workstations. By reliable execution, we mean that the Cilk-NOW system as a whole and each executing Cilk program are able to tolerate machine and network faults. Cilk-NOW provides these features while programs remain fault oblivious, meaning that Cilk programmers need not code for fault tolerance. Throughout this paper, we focus on end-to-end design decisions, and we show how these decisions allow the design to exploit high-level algorithmic properties of the Cilk programming model in order to simplify and streamline the implementation

CiteSeerX